# Multimodal Vision-Language Model

Internvl3 8B Bf16
Other
InternVL3-8B-bf16 is a vision-language model based on MLX format conversion, supporting multilingual image-to-text tasks.
Image-to-Text Transformers Other
I
mlx-community
96
1
Llama 4 Scout 17B 16E 8bit
Other
This is an MLX format model converted from Meta's Llama-4-Scout-17B-16E model, supporting multilingual and vision-language tasks.
Text-to-Image Transformers Supports Multiple Languages
L
mlx-community
252
0
Qwen2.5vl 3B VLM R1 REC 500steps
A vision-language model based on Qwen2.5-VL-3B-Instruct, enhanced with VLM-R1 reinforcement learning, focusing on referring expression comprehension tasks.
Text-to-Image Safetensors English
Q
omlab
976
22
Eagle2 9B
Eagle2 is a high-performance series of vision-language models focused on enhancing model performance through optimized data strategies and training methods. Eagle2-9B is the large model in this series, achieving a good balance between performance and inference speed.
Text-to-Image Transformers Other
E
KnutJaegersberg
15
4
Eagle2 9B
Eagle2-9B is the latest Vision-Language Model (VLM) released by NVIDIA, achieving a perfect balance between performance and inference speed. It is built on the Qwen2.5-7B-Instruct language model and the Siglip+ConvNext vision model, supporting multilingual and multimodal tasks.
Image-to-Text Transformers Other
E
nvidia
944
52
Vitamin XL 256px
MIT
ViTamin-XL-256px is a vision-language model based on the ViTamin architecture, designed for efficient visual feature extraction and multimodal tasks, supporting high-resolution image processing.
Text-to-Image Transformers
V
jienengchen
655
1
Featured Recommended AI Models
AIbase
Empowering the Future, Your AI Solution Knowledge Base
© 2025AIbase